Model Selection

Multimodal Feature Extraction

# Multimodal Feature Extraction

Openvision Vit Base Patch16 384

OpenVision is a fully open, cost-effective family of advanced vision encoders focused on image feature extraction in multimodal learning.

Multimodal Fusion

Mlcd Vit Bigg Patch14 448

MLCD-ViT-bigG is an advanced Vision Transformer model enhanced with 2D Rotary Position Encoding (RoPE2D), excelling in document understanding and visual question answering tasks.

Text Recognition

Internvit 300M 448px V2 5

InternViT-300M-448px-V2_5 is a major upgrade based on InternViT-300M-448px, enhancing visual feature extraction capabilities through ViT incremental learning and NTP loss, particularly excelling in handling multilingual OCR data and complex scenarios like mathematical charts.

Coin Clip Vit Base Patch32

A coin image retrieval model fine-tuned based on CLIP, enhancing feature extraction capabilities for coin images

Eva02 Large Patch14 224.mim M38m

EVA02 feature/representation model, pretrained on Merged-38M dataset via masked image modeling, suitable for image classification and feature extraction tasks.

Image Classification

Taiyi CLIP RoBERTa 326M ViT H Chinese

The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with RoBERTa-large architecture as the text encoder.

Transformers Chinese

Taiyi CLIP Roberta Large 326M Chinese

The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, supporting Chinese image-text feature extraction and zero-shot classification

Transformers Chinese

Taiyi CLIP Roberta 102M Chinese

The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with a text encoder based on RoBERTa-base architecture.

Transformers Chinese

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase